{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training Code\n",
    "Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import *\n",
    "\n",
    "def train_char_lm(fname, order=4):\n",
    "    data = file(fname).read()\n",
    "    lm = defaultdict(Counter)\n",
    "    pad = \"~\" * order\n",
    "    data = pad + data\n",
    "    for i in xrange(len(data)-order):\n",
    "        history, char = data[i:i+order], data[i+order]\n",
    "        lm[history][char]+=1\n",
    "    def normalize(counter):\n",
    "        s = float(sum(counter.values()))\n",
    "        return [(c,cnt/s) for c,cnt in counter.iteritems()]\n",
    "    outlm = {hist:normalize(chars) for hist, chars in lm.iteritems()}\n",
    "    return outlm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train it on Andrej's Shakespears's text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-02-01 11:33:34--  http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt\n",
      "Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64\n",
      "Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 4573338 (4.4M) [text/plain]\n",
      "Saving to: ‘shakespeare_input.txt’\n",
      "\n",
      "shakespeare_input.t 100%[===================>]   4.36M  12.8MB/s    in 0.3s    \n",
      "\n",
      "2017-02-01 11:33:34 (12.8 MB/s) - ‘shakespeare_input.txt’ saved [4573338/4573338]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok. Now let's do some queries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('!', 0.0068143100511073255),\n",
       " (' ', 0.013628620102214651),\n",
       " (\"'\", 0.017035775127768313),\n",
       " (',', 0.027257240204429302),\n",
       " ('.', 0.0068143100511073255),\n",
       " ('r', 0.059625212947189095),\n",
       " ('u', 0.03747870528109029),\n",
       " ('w', 0.817717206132879),\n",
       " ('n', 0.0017035775127768314),\n",
       " (':', 0.005110732538330494),\n",
       " ('?', 0.0068143100511073255)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['ello']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('t', 1.0)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['Firs']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(\"'\", 0.0008025682182985554),\n",
       " ('A', 0.0056179775280898875),\n",
       " ('C', 0.09550561797752809),\n",
       " ('B', 0.009630818619582664),\n",
       " ('E', 0.0016051364365971107),\n",
       " ('D', 0.0032102728731942215),\n",
       " ('G', 0.0898876404494382),\n",
       " ('F', 0.012038523274478331),\n",
       " ('I', 0.009630818619582664),\n",
       " ('H', 0.0040128410914927765),\n",
       " ('K', 0.008025682182985553),\n",
       " ('M', 0.0593900481540931),\n",
       " ('L', 0.10674157303370786),\n",
       " ('O', 0.018459069020866775),\n",
       " ('N', 0.0008025682182985554),\n",
       " ('P', 0.014446227929373997),\n",
       " ('S', 0.16292134831460675),\n",
       " ('R', 0.0008025682182985554),\n",
       " ('T', 0.0032102728731942215),\n",
       " ('W', 0.033707865168539325),\n",
       " ('a', 0.02247191011235955),\n",
       " ('c', 0.012841091492776886),\n",
       " ('b', 0.024879614767255216),\n",
       " ('e', 0.0032102728731942215),\n",
       " ('d', 0.015248796147672551),\n",
       " ('g', 0.011235955056179775),\n",
       " ('f', 0.011235955056179775),\n",
       " ('i', 0.016853932584269662),\n",
       " ('h', 0.019261637239165328),\n",
       " ('k', 0.0040128410914927765),\n",
       " ('m', 0.02247191011235955),\n",
       " ('l', 0.01043338683788122),\n",
       " ('o', 0.030497592295345103),\n",
       " ('n', 0.020064205457463884),\n",
       " ('q', 0.0016051364365971107),\n",
       " ('p', 0.00882825040128411),\n",
       " ('s', 0.03290529695024077),\n",
       " ('r', 0.0072231139646869984),\n",
       " ('u', 0.0016051364365971107),\n",
       " ('t', 0.05377207062600321),\n",
       " ('w', 0.024077046548956663),\n",
       " ('v', 0.002407704654895666),\n",
       " ('y', 0.002407704654895666)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['rst ']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating from the model\n",
    "Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from random import random\n",
    "\n",
    "def generate_letter(lm, history, order):\n",
    "        history = history[-order:]\n",
    "        dist = lm[history]\n",
    "        x = random()\n",
    "        for c,v in dist:\n",
    "            x = x - v\n",
    "            if x <= 0: return c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def generate_text(lm, order, nletters=1000):\n",
    "    history = \"~\" * order\n",
    "    out = []\n",
    "    for i in xrange(nletters):\n",
    "        c = generate_letter(lm, history, order)\n",
    "        history = history[-order:] + c\n",
    "        out.append(c)\n",
    "    return \"\".join(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generated Shakespeare from different order models\n",
    "\n",
    "Let's try to generate text based on different language-model orders. Let's start with something silly:\n",
    "\n",
    "### order 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fien;\n",
      "All the know, whou,\n",
      "Till an deed. I the lows hen he humman my forebtereered train musave thow'stiet its mans ist ove thim thet-lientsmintue suffed, welp.\n",
      "\n",
      "I hearle.\n",
      "Posem even pat his I'll mased scou me who, their youstur. She in\n",
      "MARMISTA:\n",
      "We an,\n",
      "Teas or gen me cand I kin unto scind me?\n",
      "\n",
      "HARISTRA:\n",
      "PERICK:\n",
      "\n",
      "FRA:\n",
      "Sir lewas murliumble\n",
      "Deat Trons wer!'\n",
      "\n",
      "Of th ass\n",
      "home, mand shoutim.\n",
      "\n",
      "Diegif al, athim; all; I'll her ne,\n",
      "And red Caeld man we's me,\n",
      "Whe unt heare poseche be. Yourayer!\n",
      "\n",
      "Fram min as th ang but;\n",
      "And dive, at brothen th mande thimaund hirstell broad\n",
      "th,\n",
      "SHARLAUDINE:\n",
      "I a pre en? Clou ne not, your not, shontlethy wit?\n",
      "\n",
      "Mare now th his dam, to for not.\n",
      "\n",
      "Fir.\n",
      "\n",
      "Viany.\n",
      "\n",
      "GO:\n",
      "Brou hence,\n",
      "Whaliparry\n",
      "HES:\n",
      "Gartize ito lestat but, willy,\n",
      "SIDARWICK:\n",
      "The cauld.\n",
      "\n",
      "OPARK:\n",
      "Your fathy at ster, yet or. Hal, to to cry beeld thappothince witere's ther.\n",
      "\n",
      "GUICK:\n",
      "Is allip yoned nothim.\n",
      "\n",
      "CLOCTAR:\n",
      "I boo sup if we not mure speads.\n",
      "\n",
      "DON:\n",
      "ISTERNENIO:\n",
      "Whou a not fought the spendon'tion.\n",
      "\n",
      "LEPHERICKLY:\n",
      "MENR\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=2)\n",
    "print generate_text(lm, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not so great.. but what if we increase the order to 4?\n",
    "\n",
    "### order 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Lord And ther:\n",
      "Third, he crown to restraitor, upon,\n",
      "I have you may patience;\n",
      "But I for I will feet his with me; I bless not promonth occasion of his\n",
      "knave!\n",
      "\n",
      "CATESBY:\n",
      "\n",
      "MAMILLO:\n",
      "Excellusion\n",
      "Hark!--Whose the name.\n",
      "\n",
      "SIR WALTER BLUNT:\n",
      "Now these this yours to me.\n",
      "\n",
      "HORATIO:\n",
      "O Regan.\n",
      "\n",
      "Clown:\n",
      "Conce,\n",
      "And sweariest a rowed in in the sith shall.\n",
      "\n",
      "SHALLOW:\n",
      "Nay, I wisdom\n",
      "Thy thoughts\n",
      "The whet of thought the greath;\n",
      "The criest, above\n",
      "Unto my people placed for bitter is no marry, answer master. Where confess me other full give wish.\n",
      "\n",
      "PRINCENTINE:\n",
      "Good consieur Le Benedicate, by three opposed by them is brother that vices from encourse!\n",
      "Is my knave pint\n",
      "Thy lord.\n",
      "\n",
      "FALSTAFF:\n",
      "My look about of the Romeo, he here in\n",
      "they done she neighbour\n",
      "active pat; and give is and truth a peace, I should nobody.\n",
      "\n",
      "PISANIO:\n",
      "Report\n",
      "We here her fough, and revoltingly. Reason that you, true,\n",
      "You hast the comest Rosalind up and know would a solicite,\n",
      "Betray, what comforth\n",
      "Live,\n",
      "Because the news?\n",
      "You urge\n",
      "That this, Satu\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=4)\n",
    "print generate_text(lm, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizens, make voice to my love meants, and turn the kingdom, sure,\n",
      "O'er wish.\n",
      "I'll twelves one that I, majesty come come and a dearer:\n",
      "'Tis to best orname, master\n",
      "thou will requestion of it, what not living is nothink.\n",
      "\n",
      "First Claudio, I wast delibert?\n",
      "rece of Marcius, sir, head of Venice. Well, my chambering that though token? prodigal, as Cato, and get not Nature;\n",
      "With and you; and my knew robbing come speak:\n",
      "I bear not, my hung Georget have my loved, to hides, mistre at not thou shake hath finds of that sheep-sheep stay;\n",
      "The king; and is ther's spoke them at.\n",
      "On, Macbething to their dry. Witch,\n",
      "Cannot murder father's back. Our dried, present she generald crowns, rounderers\n",
      "Are you son the good, as well thee, Hero, have he deligiously away\n",
      "A sooth done, to hand,\n",
      "And that some of danged slew,\n",
      "And all because of didst need with himselve board,\n",
      "And so? them. Nay, the hard-head\n",
      "To me ask of fear:\n",
      "Polixeness prince perceiving herbs: having and so,\n",
      "Ingration of we knighter:\n",
      "But with \n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=4)\n",
    "print generate_text(lm, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?\n",
    "\n",
    "### order 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizens are odious,\n",
      "For all incite us to the thing.\n",
      "But, I pray thee, fellow of a tavern a most no revenge, thy behalf;\n",
      "And palmy state, she's warm!\n",
      "If they did was made after weep the drum toward do you know when the comfort. But indeed, his\n",
      "majestical, apish, shall I do? I would have forswore my cheese, and thoroughly penn'd\n",
      "And on our awe,\n",
      "Or been all that humour of his chief gone with you will come of truth,\n",
      "Pluck this All-Souls' day is yourself,\n",
      "No doubt it not be a sovereign, let them on Lud's-town march\n",
      "towards London,\n",
      "To sit undertake you may pare him! God sa' me, 'twere to thee,\n",
      "But dead, hereafter, drybeat them, gentlemen\n",
      "To slay your feast him,\n",
      "This head.\n",
      "Hark! he stirs:\n",
      "Start not trouble a\n",
      "lord. If it hath a power\n",
      "Meets with the middle.\n",
      "\n",
      "First Gentleman,\n",
      "accordant, her voice goes that make a man may still?\n",
      "\n",
      "TROILUS:\n",
      "O brave brought on kiss\n",
      "She vied so long. I have struck this particulars.\n",
      "\n",
      "SLENDER:\n",
      "\n",
      "DOCTOR CAIUS:\n",
      "Verily,\n",
      "Your dangerous unto yourselves.\n",
      "\n",
      "NORFOLK:\n",
      "Be a\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=7)\n",
    "print generate_text(lm, 7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How about 10?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Ay, sir; so should I, in these greens before you choose the rigour of severest law.\n",
      "\n",
      "Second Murderer:\n",
      "He needs as many, sir, as I say, to vex\n",
      "her I will or no?\n",
      "O, torture me in this plain, so many hollow falsehood!\n",
      "Why did he so? I charge ye, bear her fan!\n",
      "To see him in a dark hour. Resolve yourself:\n",
      "Nay, task me to my chance,\n",
      "Is queen of all this?\n",
      "\n",
      "BRUTUS:\n",
      "Could you on the hip,\n",
      "Abuse him to a better gone, so must thy grave,\n",
      "And give us notice of your duty throughly.\n",
      "Signior Benedick, the marriage: Lastly,\n",
      "If I do vow a friendship\n",
      "That young Prince John a full commission: I will\n",
      "on with my good cousin; farewell.\n",
      "And now, good Cassio!\n",
      "\n",
      "IAGO:\n",
      "Awake! what, have you served the place,\n",
      "And we'll consign to.\n",
      "\n",
      "KING LEAR:\n",
      "O, ho, are you hurt, lieutenant; and 'tis powerful grace than boy.\n",
      "\n",
      "DUKE:\n",
      "Where you shall hear music and see the bottom of the selfsame sun that makes that men should possess.\n",
      "\n",
      "KING:\n",
      "I am wrapp'd in fire,\n",
      "To burn the lodging out. Give him the revolt of thine was\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=10)\n",
    "print generate_text(lm, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### This works pretty well\n",
    "\n",
    "With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### So why am I impressed with the RNNs after all?\n",
    "\n",
    "Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. \n",
    "\n",
    "However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. \n",
    "\n",
    "If the examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.\n",
    "\n",
    "Just for the fun of it, let's see what our simple language model does with the linux-kernel code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-02-01 11:34:40--  http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt\n",
      "Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64\n",
      "Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 6206996 (5.9M) [text/plain]\n",
      "Saving to: ‘linux_input.txt’\n",
      "\n",
      "linux_input.txt     100%[===================>]   5.92M  12.7MB/s    in 0.5s    \n",
      "\n",
      "2017-02-01 11:34:40 (12.7 MB/s) - ‘linux_input.txt’ saved [6206996/6206996]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "~~~~~~~/*\n",
      " * linux/kernel/irq/resend.c\n",
      " *\n",
      " *  Kernel Probes (KProbes)\n",
      " *\n",
      " * The reason not to enable\n",
      "\t * that is why we\n",
      "\t\t * are not going idle with sched tick stopped.\n",
      "\t\t *\n",
      "\t\t * However we can't #include <linux/module.h>\n",
      "#include <linux/smp.h>\n",
      "#include <linux/irq.h>\n",
      "#include <linux/rculist.h>\n",
      "#include <linux/types.h>\n",
      "#include <linux/reboot.h>\n",
      "#include <linux/kgdb.h>\n",
      "#include <linux/freezer.h>\n",
      "#include <linux/init.h>\n",
      "#include <linux/bitops.h>\n",
      "#include <linux/mutex.h>\n",
      "#endif\n",
      "\n",
      "#ifndef CONFIG_TIMER_STATE_MIGRATE;\n",
      "\t}\n",
      "}\n",
      "\n",
      "static void padata_replace(pinst, pd);\n",
      "\n",
      "\t\tcpumask_clear_cpu(cpu, tbl[node]);\n",
      "}\n",
      "\n",
      "static int msi_domain_info *info = filp->private_data;\n",
      "\t\tint cpu = (unsigned long get_symbol_pos(addr, NULL, &ftrace_traceoff,\n",
      "\t.print\t\t\t= ftrace_graph(&trace_graph_addr(trace->func)\n",
      "\t\t\treturn 0;\n",
      "\t\t/*\n",
      "\t\t * Tasks are prohibited area\n",
      "\t */\n",
      "\tlist_for_each_entry(mod, &modules, pos);\n",
      "}\n",
      "\n",
      "static int bpf_prog_types);\n",
      "}\n",
      "\n",
      "/**\n",
      " *\tcreate_mem_extents(struct smp_hotplug_thread rcu_cpu_kthread_stop(*tp);\n",
      "\t*tp =\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"linux_input.txt\", order=10)\n",
    "print generate_text(lm, 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      " *     wait->flags &= ~WQ_FLAG_EXCLUSIVE;\n",
      "\tspin_lock_irqsave(&cpu_pm_notifier_lock, flags);\n",
      "\n",
      "out_unlock:\n",
      "\tread_unlock(&tasklist_lock);\n",
      "\t\t\tfor_each_process_thread(g, p) {\n",
      "\t\tif (rt_task(p) && task_group(p) == tg)\n",
      "\t\t\treturn 1;\n",
      "\t\treturn 0;\n",
      "\t}\n",
      "\n",
      "\tif (!param)\n",
      "\t\tgoto out_regex_unlock;\n",
      "\t}\n",
      "\n",
      "\tif (cmpxchg64(&scd->clock, 0, 0);\n",
      "#else\n",
      "\t/*\n",
      "\t * On 64bit the read of [my]scd->clock is atomic versus the\n",
      "\t * update on the remote cpu.\n",
      " *\n",
      " * returns 0 if\n",
      " *  - funcgraph-interrupts option is set\n",
      " *  - we are not inside the irq code, and this is not the last online CPU executing the function symbol\n",
      " * from the kallsyms table size */\n",
      "\tchar *knt1 = NULL;\n",
      "\n",
      "\tif (unlikely(prof_on == SLEEP_PROFILING;\n",
      "\t\tif (str[strlen(sleepstr))) {\n",
      "#ifdef CONFIG_KGDB_KDB\n",
      "#include <linux/spinlock.h>\n",
      "#include <linux/init.h>\n",
      "#include <linux/random.h>\n",
      "#include <linux/syscalls.h>\n",
      "#include <linux/export.h>\n",
      "\n",
      "void __raw_spin_lock_nested);\n",
      "\n",
      "void __lockfunc _raw_write_unlock_irq(&tasklist_lock);\n",
      "\tif (child->ptrace && child->parent = child->r\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"linux_input.txt\", order=15)\n",
    "print generate_text(lm, 15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/irq/spurious.c\n",
      " *\n",
      " * Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com>\n",
      " * Copyright (C) 2002-2004 Timesys Corporation\n",
      " * Copyright (C) 2004 IBM Corporation\n",
      " *\n",
      " *  Author: Serge Hallyn <serue@us.ibm.com>\n",
      " *\n",
      " *  This program is free software; you can redistribute it and/or modify\n",
      " * it under the terms of the GNU General Public License version 2 as\n",
      " * published by the Free Software Foundation; version 2\n",
      " * of the License.\n",
      " */\n",
      "\n",
      "/*\n",
      " * CONFIG_LATENCYTOP enables a kernel latency tracking infrastructure that is\n",
      " * used by the \"latencytop\" userspace tool. The latency that is tracked is not\n",
      " * the 'traditional' interrupt latency (which is primarily caused by something\n",
      " * else consuming CPU), but instead, it is the latency an application encounters\n",
      " * because the kernel sleeps on its behalf for various reasons.\n",
      " *\n",
      " * This code tracks 2 levels of statistics:\n",
      " * 1) System level latency\n",
      " * 2) Per process latency\n",
      " *\n",
      " * The latency is stored in fixed sized data structures in a\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"linux_input.txt\", order=20)\n",
    "print generate_text(lm, 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/itimer.c\n",
      " *\n",
      " * Copyright (C) 1992 Darren Senn\n",
      " */\n",
      "\n",
      "/* These are all the functions necessary to implement itimers */\n",
      "\n",
      "#include <linux/perf_event.h>\n",
      "#include <linux/debugfs.h>\n",
      "#include <linux/mm.h>\n",
      "#include <linux/futex.h>\n",
      "#include <linux/rcupdate.h>\n",
      "#include <linux/interrupt.h>\n",
      "#include <linux/compiler.h>\n",
      "\n",
      "#include \"tick-internal.h\"\n",
      "\n",
      "/*\n",
      " * Tick devices\n",
      " */\n",
      "DEFINE_PER_CPU(u64, cpu_hardirq_time);\n",
      "DEFINE_PER_CPU(u64, cpu_hardirq_time);\n",
      "DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_loops);\n",
      "DEFINE_PER_CPU(char, rcu_cpu_has_work);\n",
      "\n",
      "#endif /* #ifdef CONFIG_RCU_BOOST\n",
      "\tseq_printf(m, \" kt=%d/%c ktl=%x\",\n",
      "\t\t   per_cpu(rcu_cpu_kthread_loops);\n",
      "\t\tlocal_irq_disable();\n",
      "\n",
      "\tpending = local_softirq_pending();\n",
      "}\n",
      "\n",
      "static void rcu_prepare_cpu(int cpu)\n",
      "{\n",
      "\tstruct ring_buffer *rb);\n",
      "extern struct ring_buffer *rb)\n",
      "{\n",
      "\treturn 0;\n",
      "}\n",
      "#endif /* CONFIG_RCU_USER_QS */\n",
      "\n",
      "/**\n",
      " * rcu_irq_exit - inform RCU that current CPU is entering idle\n",
      " *\n",
      " * Enter idle mode, in other words, leaving the mode in which read-\n"
     ]
    }
   ],
   "source": [
    "print generate_text(lm, 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/irq/proc.c\n",
      " *\n",
      " * Copyright 2003-2009 Red Hat, Inc.\n",
      " * Copyright (C) 2012 Rafael J. Wysocki <rjw@sisk.pl>\n",
      " *\n",
      " * This file is subject to the terms and conditions of the GNU General Public License\n",
      " * along with this program; if not, you can access it online at\n",
      " * http://www.gnu.org/licenses/gpl-2.0.html.\n",
      " *\n",
      " * Copyright IBM Corporation, 2008-2012\n",
      " * Authors:\n",
      " *\tSrikar Dronamraju\n",
      " *\tJim Keniston\n",
      " * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com>\n",
      " * Copyright (C) 2003-2004 Amit S. Kale <amitkale@linsyssoft.com>\n",
      " * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>\n",
      " *\n",
      " *  Interactivity improvements to CFS by Mike Galbraith\n",
      " *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri\n",
      " *  Copyright IBM Corp. 2009\n",
      " *    Author(s): Peter Oberparleiter <oberpar@linux.vnet.ibm.com>\n",
      " *\n",
      " *  Scaled math optimizations by Thomas Gleixner\n",
      " *  Copyright (C) 2004-2006 Tom Rini <trini@kernel.crashing.org>\n",
      " * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.\n",
      " * Copyright (c) 2009 Pavel Machek <pavel@ucw.cz>\n",
      " * Copyright (C) 2008, 2005\tIBM Corporation.\n",
      " * Copyright (c) 2009 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.\n",
      " *\n",
      " * This file is licensed under the terms of the GNU General Public License\n",
      "    along with this program; if not, write to the Free Software\n",
      " * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307  USA\n",
      " *\n",
      " * This code was copied from kernel/trace/trace_kprobe.c written by\n",
      " * Masami Hiramatsu <mhiramat@redhat.com>\n",
      " *\n",
      " *  - Added format output of fields of the trace point.\n",
      " *    This was based off of work by Tom Zanussi <tzanussi@gmail.com>\n",
      " */\n",
      "\n",
      "#include <linux/jiffies.h>\n",
      "#include <linux/freezer.h>\n",
      "#include <linux/seq_file.h>\n",
      "#include <linux/memcontrol.h>\n",
      "#include <linux/kgdb.h>\n",
      "#include <linux/cpuset.h>\n",
      "#include <linux/clocksource.h>\n",
      "#include <linux/tsacct_kern.h>\n",
      "#include <linux/device.h>\n",
      "#include <linux/vmalloc.h>\n",
      "#include <linux/rcupdate.h>\n",
      "#include <linux/init.h>\n",
      "#include <linux/spinlock.h>\n",
      "#include <linux/list.h>\n",
      "#include <linux/module.h>\n",
      "#include <linux/highmem.h>\n",
      "#include <asm/mmu_context.h>\n",
      "#include <linux/audit.h>\n",
      "\n",
      "#include <net/sock.h>\n",
      "#include \"audit.h\"\t/* audit_signal_info() */\n",
      "\n",
      "/*\n",
      " * SLAB caches for signal bits.\n",
      " */\n",
      "\n",
      "static struct kmem_cache *fs_cachep;\n",
      "\n",
      "/* SLAB cache for fs_struct structures (tsk->fs) */\n",
      "struct kmem_cache *delayacct_cache;\n",
      "\n",
      "static int __init snapshot_device_init(void)\n",
      "{\n",
      "\treturn sysfs_create_group(&mod->mkobj.kobj, &mod->mkobj.mp->grp);\n",
      "\tif (err)\n",
      "\t\tfree_module_param_attrs(struct module *mod, struct load_info *info)\n",
      "{\n",
      "\tstatic unsigned long\t\ttrace_buf_size = TRACE_BUF_SIZE_DEFAULT\t1441792UL /* 16384 * 88 (sizeof(entry)) */\n",
      "\n",
      "static unsigned long\t\tgdb_regs[(NUMREGBYTES +\n",
      "\t\t\t\t\tsizeof(unsigned long) unsigned long\n",
      " * values from/to the user buffer, treated as an ASCII string. The values\n",
      " * are treated as milliseconds, and converted to jiffies when they are stored.\n",
      " *\n",
      " * This routine will record that the cpu is going idle with tick stopped.\n",
      " * This info will be used in performing idle load balancing in the future.\n",
      " */\n",
      "void nohz_balance_enter_idle(cpu);\n",
      "\t\t\tcalc_load_enter_idle();\n",
      "\n",
      "\t\t\tts->last_tick = hrtimer_get_expires(timer), now);\n",
      "\t/* Return 0 only, when the timer is expired and not pending */\n",
      "\tif (remaining.tv64 <= 0) {\n",
      "\t\t/*\n",
      "\t\t * A single shot SIGEV_NONE timer must return 0, when\n",
      "\t\t * it is expired !\n",
      "\t\t */\n",
      "\t\tif ((timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL))\n",
      "\t\treturn NULL;\n",
      "\n",
      "\treturn page_address(page);\n",
      "\n",
      "\trb_init_page(bpage->page);\n",
      "\t}\n",
      "\n",
      "\treturn 0;\n",
      "\n",
      "free:\n",
      "\tklp_free_objects_limited(patch, obj);\n",
      "\tkobject_put(&patch->kobj);\n",
      "}\n",
      "\n",
      "static int klp_enable_object(struct klp_object *obj)\n",
      "{\n",
      "\tstruct klp_func *func;\n",
      "\tint ret;\n",
      "\tconst char *name;\n",
      "\tbool gplok;\n",
      "\tbool warn;\n",
      "\n",
      "\t/* Output */\n",
      "\tstruct module *owner,\n",
      "\t\t\t\t    void *data),\n",
      "\t\t\t void *data)\n",
      "{\n",
      "\tstruct perf_sample_data *data,\n",
      "\t\t\t\tstruct pt_regs *regs,\n",
      "\t\t\t\t\t\t void *addr, void *dest)\n",
      "{\n",
      "\tint len;\n",
      "\tvoid __user *vaddr = (void __force __user *) addr;\n",
      "\n",
      "\tlen = strnlen_user(vaddr, MAX_STRING_SIZE);\n",
      "\n",
      "\tif (len == 0 || len > MAX_STRING_SIZE)  /* Failed to check length */\n",
      "\t\t*(u32 *)dest = 0;\n",
      "\telse\n",
      "\t\t*(u32 *)dest = len;\n",
      "}\n",
      "\n",
      "static unsigned long sync_rcu_preempt_exp_mutex.\n",
      " */\n",
      "static DEFINE_MUTEX(rt_constraints_mutex);\n",
      "\tput_online_cpus();\n",
      "}\n",
      "EXPORT_SYMBOL_GPL(irq_setup_generic_chip);\n",
      "\n",
      "/**\n",
      " * irq_setup_generic_chip);\n",
      "\n",
      "/**\n",
      " * irq_setup_generic_chip(struct irq_domain *domain, unsigned int irq_base,\n",
      "\t\t\t    unsigned long *gpnum, unsigned long *completed)\n",
      "{\n",
      "\tstruct rcu_state *rsp)\n",
      "{\n",
      "}\n",
      "\n",
      "static void increment_cpu_stall_ticks(void)\n",
      "{\n",
      "}\n",
      "\n",
      "#endif /* #else #ifdef CONFIG_RCU_TRACE */\n",
      "\n",
      "static unsigned int find_pcpusec(struct load_info *info,\n",
      "\t\t\t   struct kernel_param *kparam,\n",
      "\t\t\t   unsigned int nr_irqs)\n",
      "{\n",
      "\tint i;\n",
      "\n",
      "\tfor (i = 0; i < DBG_MAX_REG_NUM; i++) {\n",
      "\t\tdbg_set_reg(i, ptr + idx, regs);\n",
      "\t\tidx += dbg_reg_def[i].size);\n",
      "}\n",
      "\n",
      "/* Handle the 'p' individual regster set */\n",
      "static void gdb_cmd_setregs(struct kgdb_state *ks)\n",
      "{\n",
      "\t/*\n",
      "\t * We know that this packet is only sent\n",
      "\t * during initial connect.  So to be safe,\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print generate_text(lm, 20, nletters=5000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the \n",
    "and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. \n",
    "\n",
    "How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as \"have I seen ( but not )\" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. \n",
    "\n",
    "The LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
